State values

State-value Function 状态价值函数，衡量在给定策略下从某状态出发的长期优劣。

基本定义

对一个满足马尔可夫性质的 MDP，考虑时间步序列 $t = 0, 1, 2, \dots$ ，在时间步 $t$ 智能体处于状态 $S_{t}$ ，按策略 $π$ 选择动作 $A_{t}$ ，得到下一状态 $S_{t + 1}$ 和即时回报 $R_{t + 1}$ ：

S_{t} \overset{A_{t}}{\to} S_{t + 1}, R_{t + 1}

$S_{t}, S_{t + 1}, A_{t}, R_{t + 1}$ 都是随机变量。从 $t$ 出发的状态-动作-回报轨迹为：

S_{t} \overset{A_{t}}{\to} S_{t + 1}, R_{t + 1} \overset{A_{t + 1}}{\to} S_{t + 2}, R_{t + 2} \overset{A_{t + 2}}{\to} S_{t + 3}, R_{t + 3} \dots

折扣回报（return）定义为：

G_{t} ≐ R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots

其中 $γ \in (0, 1)$ 是折扣率。 $G_{t}$ 是随机变量。随机变量

Since $G_{t}$ is a random variable, we can calculate its expected value (also called the expectation or mean): 期望

v_{π} (s) ≐ E [G_{t} | S_{t} = s] .

Here, $v_{π} (s)$ is called the state-value function 状态价值函数 or simply the state value of $s$ . Some important remarks are given below.

$v_{π} (s)$ depends on $s$ . This is because its definition is a conditional expectation with the condition that the agent starts from $S_{t} = s$ . 状态为 $s$ 时的期望
$v_{π} (s)$ depends on $π$ . This is because the trajectories are generated by following the policy $π$ . For a different policy, the state value may be different.
$v_{π} (s)$ does not depend on $t$ . If the agent moves in the state space, $t$ represents the current time step. The value of $v_{π} (s)$ is determined once the policy is given.

形式化定义（式 3.12）：

v_{π} (s) ≐ E_{π} [G_{t} ∣ S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} ∣ S_{t} = s], \forall s \in S

其中 $E_{π} [\cdot]$ 表示在策略 $π$ 下的期望。终态的价值始终为零。

类似地，动作价值函数（action-value function）定义为在状态 $s$ 采取动作 $a$ 后、遵循策略 $π$ 的期望回报：

q_{π} (s, a) ≐ E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} ∣ S_{t} = s, A_{t} = a]

状态价值是动作价值关于策略 $π$ 的期望：

v_{π} (s) = \sum_{a} π (a ∣ s) q_{π} (s, a)

反之，动作价值可用状态价值和动力学函数表示：

q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]

将回报递推 $G_{t} = R_{t + 1} + γ G_{t + 1}$ 代入 $v_{π}$ 的定义（式 3.14 的完整推导）：

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] (by (3.9)) \\ = \sum_{a} π (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | s, a) [r + γ E_{π} [G_{t + 1} ∣ S_{t + 1} = s^{'}]] \\ = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{π} (s^{'})] \end{aligned}

这就是 Bellman 方程。它将一个状态的价值表示为后继状态价值的递推关系，是动态规划、TD 学习和 MC 方法的理论基础。

$q_{π}$ 的 Bellman 方程为：

q_{π} (s, a) = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ \sum_{a^{'}} π (a^{'} | s^{'}) q_{π} (s^{'}, a^{'})]

最优状态价值函数 $v_{*}$ 和最优动作价值函数 $q_{*}$ 对应所有策略中能取得的最大期望回报：

v_{*} (s) ≐ max_{π} v_{π} (s), q_{*} (s, a) ≐ max_{π} q_{π} (s, a)

对给定的 MDP， $v_{*}$ 和 $q_{*}$ 是唯一的，但最优策略可能有多个。 $v_{*}$ 与 $q_{*}$ 的关系：

v_{*} (s) = max_{a} q_{*} (s, a), q_{*} (s, a) = E [R_{t + 1} + γ v_{*} (S_{t + 1}) ∣ S_{t} = s, A_{t} = a]

最优价值函数满足 Bellman 最优方程：

v_{*} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ v_{*} (s^{'})]

q_{*} (s, a) = \sum_{s^{'}, r} p (s^{'}, r | s, a) [r + γ max_{a^{'}} q_{*} (s^{'}, a^{'})]

对有限 MDP，Bellman 最优方程有唯一解。任何关于 $v_{*}$ 或 $q_{*}$ 的贪心策略都是最优策略。

状态价值比回报更正式地用于评估策略：产生更大状态价值的策略更优。Bellman 方程提供了计算状态价值的核心工具。